Fix DeployBot recovery deadlock from stale re-pause by mberman84 · Pull Request #48 · Forward-Future/DeployBot

mberman84 · 2026-06-22T18:50:41Z

Problem

A failed exact-main release that an operator explicitly unpaused could never land its repair, deadlocking delivery.

Reproduction (from the report):

CI failed for main 79cbedf6….
Repair PR #970 was queued and fully green at 86152d68….
deploybot unpause created a running recovery control.
The next DeployBot run ignored that recovery, saw the old failed CI, returned release-held, and wrote a new ci-failed pause.
The repair could never merge, because merging it was the only way to produce a new passing main CI run.

The root cause is that the original CI failure lingers on the failed SHA until the repair merges. Any coordinator run — especially a workflow pinned to an older DeployBot release — rereads that same result, writes a fresh, byte-identical pause, and overwrites the running recovery. The release-admission fence also keeps holding that failed SHA, so the repair never drains.

Fix

Three coupled changes break the deadlock and make recovery robust against stale reconciliation:

Preserve the recovery (records reconciliation). A recovery now remembers the exact SHA and reason it resumed. An unconditional pause that merely restates that already-recovered failure is ignored, so a lagging or older-version worker can no longer clobber the recovery. A genuinely new failure (a different SHA, or a different reason such as a later deploy failure) still pauses normally, preserving concurrent-pause ownership.
Merge the repair (reactor admission fence). While a recovery owns the current failed main, the release-admission fence no longer holds that SHA. The elected repair drains and advances main past the failed revision, and the new main is then followed and verified normally.
Wake immediately (unpause). After recording the durable recovery, DeployBot reacts right away so the repair merges without waiting for the next delivery event or the five-minute reconciliation sweep. --no-wake opts out, and --follow/--dispatch-ci/--timeout shape that wake-up reaction. The recovery is already durable, so a transient wake-up error is reported but never re-pauses the pipeline.

Tests

New regressions cover the stale re-pause race, a genuinely new failure still pausing, the reactor merging a repair during recovery (and the bypass staying scoped to the exact unpaused SHA), and the unpause wake (including opt-out and durable-recovery-survives-wake-failure).

All existing behavior is preserved. CI parity verified locally: ruff check src tests, python -m unittest discover -s tests (251 tests OK), and python -m build all pass.

Note on versioning

The runtime pin (RELEASE_COMMIT / vX.Y.Z references in README, action, and client configs) points at a published release commit, so it is intentionally left to the separate release-pin step once this fix has a merge commit. The code fix itself is version-independent and robust even during a rolling upgrade.

A failed exact-main release that an operator explicitly unpaused could never land its repair. The original CI failure lingers on the failed SHA until the repair merges, so any coordinator run (especially a workflow pinned to an older release) rereads that result, writes a fresh, byte identical pause, and overwrites the running recovery. The repair then sits behind a release that can only turn green once the repair merges. Three coupled fixes break the deadlock: - records.latest_control: a recovery now carries the exact SHA and reason it resumed, and an unconditional pause that merely restates that same already-recovered failure is ignored. A genuinely new failure (a different SHA, or a different reason such as a later deploy failure) still pauses normally, so concurrent-pause ownership is preserved. - command_react: while a recovery owns the current failed main, the release-admission fence no longer holds that SHA. The elected repair drains and advances main past the failed revision; the new main is then followed and verified normally. - command_unpause: after recording the durable recovery, DeployBot reacts immediately so the repair merges without waiting for the next delivery event or the five-minute reconciliation sweep. --no-wake opts out, and --follow/--dispatch-ci/--timeout shape the wake-up reaction. The recovery is durable, so a transient wake-up error is reported but never re-pauses the pipeline. Adds regressions covering the stale re-pause race, the reactor merging a repair during recovery, the scoping of that bypass, and the unpause wake. Co-authored-by: mberman84 <mberman84@users.noreply.github.com>

cursor Bot merged commit d615630 into main Jun 22, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix DeployBot recovery deadlock from stale re-pause#48

Fix DeployBot recovery deadlock from stale re-pause#48
cursor[bot] merged 1 commit into
mainfrom
cursor/fix-recovery-deadlock-stale-repause-fc6d

mberman84 commented Jun 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mberman84 commented Jun 22, 2026

Problem

Fix

Tests

Note on versioning

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants